{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Datawhale 智慧海洋建设-Task5 模型融合" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.1 学习目标" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "学习融合策略\n", "\n", "完成相应学习打卡任务" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.2 内容介绍" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://mlwave.com/kaggle-ensembling-guide/ \n", "https://github.com/MLWave/Kaggle-Ensemble-Guide" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "模型融合是比赛后期一个重要的环节,大体来说有如下的类型方式。\n", "\n", "1. 简单加权融合:\n", " - 回归(分类概率):算术平均融合(Arithmetic mean),几何平均融合(Geometric mean);\n", " - 分类:投票(Voting)\n", "\n", "\n", "2. boosting/bagging(在xgboost,Adaboost,GBDT中已经用到):\n", " - 多树的提升方法\n", " \n", " \n", "3. stacking/blending:\n", " - 构建多层模型,并利用预测结果再拟合预测。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.3 相关理论介绍" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.3.1 简单加权融合" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**平均法-Averaging**\n", "\n", "1. 对于回归问题,一个简单直接的思路是取平均。将多个模型的回归结果取平均值作为最终预测结果,进而把多个弱分类器荣和城强分类器。\n", "\n", "2. 稍稍改进的方法是进行加权平均,权值可以用排序的方法确定,举个例子,比如A、B、C三种基本模型,模型效果进行排名,假设排名分别是1,2,3,那么给这三个模型赋予的权值分别是3/6、2/6、1/6。\n", "\n", "3. 平均法或加权平均法看似简单,其实后面的高级算法也可以说是基于此而产生的,Bagging或者Boosting都是一种把许多弱分类器这样融合成强分类器的思想。\n", "\n", "4. Averaging也可以用于对分类问题的概率进行平均。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**投票法-voting**\n", "\n", "1. 对于一个二分类问题,有3个基础模型,现在我们可以在这些基学习器的基础上得到一个投票的分类器,把票数最多的类作为我们要预测的类别。\n", "\n", "2. 投票法有硬投票(hard voting)和软投票(soft voting)\n", "\n", "3. 硬投票: 对多个模型直接进行投票,不区分模型结果的相对重要度,最终投票数最多的类为最终被预测的类。\n", "\n", "4. 软投票:增加了设置权重的功能,可以为不同模型设置不同权重,进而区别模型不同的重要度。\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.3.2 stacking/blending" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 堆叠法-stacking " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "**基本思想**:用初始训练数据学习出若干个基学习器后,将这几个学习器的预测结果作为新的训练集(第一层),来学习一个新的学习器(第二层)。\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**背景**: 为了帮助大家理解模型的原理,我们先假定一下数据背景。\n", "1. 训练集数据大小为`10000*100`,测试集大小为`3000*100`。即训练集有10000条数据、100个特征;测试集有3000条数据、100个特征。该数据对应**回归问题**。\n", "\n", "2. 第一层使用三种算法-XGB、LGB、NN。第二层使用GBDT。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**算法解读**\n", "1. **stacking 第一层**\n", "\n", " 1. XGB算法 - 对应图中`model 1`部分\n", " - 输入:使用训练集进行5-fold处理\n", " - 处理:具体处理细节如下\n", " - 使用1、2、3、4折作为训练集,训练一个XGB模型并预测第5折和测试集,将预测结果分别称为**XGB-pred-tran5**(shape `2000*1`)和**XGB-pred-test1**(shape `3000*1`).\n", " - 使用1、2、3、5折作为训练集,训练一个XGB模型并预测第4折和测试集,将预测结果分别称为**XGB-pred-tran4**(shape `2000*1`)和**XGB-pred-test2**(shape `3000*1`).\n", " - 使用1、2、4、5折作为训练集,训练一个XGB模型并预测第3折和测试集,将预测结果分别称为**XGB-pred-tran3**(shape `2000*1`)和**XGB-pred-test3**(shape `3000*1`).\n", " - 使用1、3、4、5折作为训练集,训练一个XGB模型并预测第2折和测试集,将预测结果分别称为**XGB-pred-tran2**(shape `2000*1`)和**XGB-pred-test4**(shape `3000*1`).\n", " - 使用2、3、4、5折作为训练集,训练一个XGB模型并预测第1折和测试集,将预测结果分别称为**XGB-pred-tran1**(shape `2000*1`)和**XGB-pred-test5**(shape `3000*1`).\n", " - 输出:\n", " - 将XGB分别对1、2、3、4、5折进行预测的结果合并,得到**XGB-pred-tran**(shape `10000*1`)。并且根据5-fold的原理可以知道,与原数据可以形成对应关系。因此在图中称为NEW FEATURE。\n", " - 将XGB-pred-test1 - 5 的结果使用Averaging的方法求平均值,最终得到**XGB-pred-test**(shape `3000*1`)。\n", " \n", " 2. LGB算法 - 同样对应图中`model 1`部分\n", " - 输入:与XGB算法一致\n", " - 处理:与XGB算法一致。只需更改预测结果的命名即可,如**LGB-pred-tran5**和**LGB-pred-test1**\n", " - 输出:\n", " - 将LGB分别对1、2、3、4、5折进行预测的结果合并,得到**LGB-pred-tran**(shape `10000*1`)。\n", " - 将LGB-pred-test1 - 5 的结果使用Averaging的方法求平均值,最终得到**LGB-pred-test**(shape `3000*1`)。\n", " \n", " 3. NN算法 - 同样对应图中`model 1`部分\n", " - 输入:与XGB算法一致\n", " - 处理:与XGB算法一致。只需更改预测结果的命名即可,如**NN-pred-tran5**和**NN-pred-test1**\n", " - 输出:\n", " - 将NN分别对1、2、3、4、5折进行预测的结果合并,得到**NN-pred-tran**(shape `10000*1`)。\n", " - 将NN-pred-test1 - 5 的结果使用Averaging的方法求平均值,最终得到**NN-pred-test**(shape `3000*1`)。\n", "\n", "2. **stacking 第二层**\n", " - 训练集:将三个新特征 **XGB-pred-tran**、**LGB-pred-tran**、**NN-pred-tran**合并得到新的训练集(shape `10000*3`)\n", " - 测试集:将三个新测试集**XGB-pred-test**、**LGB-pred-test**、**NN-pred-test**合并得到新的测试集(shape `30000*3`)\n", " - 用新训练集和测试集构造第二层的预测器,即GBDT模型" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![在这里插入图片描述](https://img-blog.csdnimg.cn/20210401090352724.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NDU4NTgzOQ==,size_16,color_FFFFFF,t_70)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 混合法 - blending" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Blending与Stacking大致相同,只是Blending的主要区别在于训练集不是通过K-Fold的CV策略来获得预测值从而生成第二阶段模型的特征,而是建立一个Holdout集。简单来说,Blending直接用不相交的数据集用于不同层的训练。\n", "\n", "同样以上述数据集为例,构造一个两层的Blending模型。\n", "\n", "首先将训练集划分为两部分(d1,d2),例如d1为4000条数据用于blending的第一层,d2是6000条数据用于blending的第二层。\n", "\n", "第一层:用d1训练多个模型,将其对d2和test的预测结果作为第二层的New Features。例如同样适用上述三个模型,对d2生成`6000*3`的新特征数据;对test生成`3000*3`的新特征矩阵。\n", "\n", "第二层:用d2的New Features和标签训练新的分类器,然后把test的New Features输入作为最终的测试集,对test预测出的结果就是最终的模型融合的值。\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 优缺点对比" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Blending的优点在于:\n", "\n", "1. 比stacking简单(因为不用进行k次的交叉验证来获得stacker feature)\n", "\n", "2. 避开了一个信息泄露问题:generlizers和stacker使用了不一样的数据集\n", "\n", "3. 在团队建模过程中,不需要给队友分享自己的随机种子\n", "\n", "而缺点在于:\n", "\n", "1. 使用了很少的数据(是划分hold-out作为测试集,并非cv)\n", "\n", "2. blender可能会过拟合(其实大概率是第一点导致的)\n", "\n", "3. stacking使用多次的CV会比较稳健" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.4 代码实现" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import warnings\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "warnings.filterwarnings('ignore')\n", "%matplotlib inline\n", "\n", "import itertools\n", "import matplotlib.gridspec as gridspec\n", "from sklearn import datasets\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.naive_bayes import GaussianNB \n", "from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor\n", "from sklearn.linear_model import LogisticRegression\n", "# from mlxtend.classifier import StackingClassifier\n", "from sklearn.model_selection import cross_val_score, train_test_split\n", "# from mlxtend.plotting import plot_learning_curves\n", "# from mlxtend.plotting import plot_decision_regions\n", "\n", "from sklearn.model_selection import StratifiedKFold\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.model_selection import StratifiedKFold\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.ensemble import AdaBoostClassifier\n", "from sklearn.ensemble import VotingClassifier\n", "import lightgbm as lgb\n", "from sklearn.neural_network import MLPClassifier,MLPRegressor\n", "from sklearn.metrics import mean_squared_error, mean_absolute_error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.4.1 load data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.metrics import classification_report, f1_score\n", "from sklearn.model_selection import StratifiedKFold, KFold,train_test_split" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def reduce_mem_usage(df):\n", " start_mem = df.memory_usage().sum() / 1024**2 \n", " print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))\n", " \n", " for col in df.columns:\n", " col_type = df[col].dtype\n", " \n", " if col_type != object:\n", " c_min = df[col].min()\n", " c_max = df[col].max()\n", " if str(col_type)[:3] == 'int':\n", " if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:\n", " df[col] = df[col].astype(np.int8)\n", " elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:\n", " df[col] = df[col].astype(np.int16)\n", " elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:\n", " df[col] = df[col].astype(np.int32)\n", " elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:\n", " df[col] = df[col].astype(np.int64) \n", " else:\n", " if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:\n", " df[col] = df[col].astype(np.float16)\n", " elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:\n", " df[col] = df[col].astype(np.float32)\n", " else:\n", " df[col] = df[col].astype(np.float64)\n", " else:\n", " df[col] = df[col].astype('category')\n", "\n", " end_mem = df.memory_usage().sum() / 1024**2 \n", " print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))\n", " print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))\n", " \n", " return df" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Memory usage of dataframe is 30.28 MB\n", "Memory usage after optimization is: 7.59 MB\n", "Decreased by 74.9%\n" ] } ], "source": [ "all_df = pd.read_csv('data/group_df.csv',index_col=0)\n", "all_df = reduce_mem_usage(all_df)\n", "all_df = all_df.fillna(99)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(9000, 440)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_df.shape" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " 2 4361\n", "-1 2000\n", " 0 1621\n", " 1 1018\n", "Name: label, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_df['label'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "all_df中label为0/1/2的为训练集,一共有7000条;label为-1的为测试集,一共有2000条。\n", "1. label为-1的测试集没有label,这部分数据用于模拟真实比赛提交数据。\n", "\n", "2. train数据均有标签,我们将从中分出30%作为验证集,其余作为训练集。在验证集上比较模型性能优劣,模型性能均使用f1作为评分。\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "train = all_df[all_df['label'] != -1]\n", "test = all_df[all_df['label'] == -1]\n", "feats = [c for c in train.columns if c not in ['ID', 'label']]\n", "\n", "# 根据7:3划分训练集和测试集\n", "X_train,X_val,y_train,y_val= train_test_split(train[feats],train['label'],test_size=0.3,random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.4.2 单模及加权融合" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里训练三个单模,分别是用了一个三种不同的RF/LGB/LGB模型。事实上模型融合需要基础分类器之间存在差异,一般不会选用相同的分类器模型。这里只是作为展示。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# 单模函数\n", "def build_model_rf(X_train,y_train):\n", " model = RandomForestClassifier(n_estimators = 100)\n", " model.fit(X_train, y_train)\n", " return model\n", "\n", "\n", "def build_model_lgb(X_train,y_train):\n", " model = lgb.LGBMClassifier(num_leaves=127,learning_rate = 0.1,n_estimators = 200)\n", " model.fit(X_train, y_train)\n", " return model\n", "\n", "\n", "def build_model_lgb2(X_train,y_train):\n", " model = lgb.LGBMClassifier(num_leaves=63,learning_rate = 0.05,n_estimators = 400)\n", " model.fit(X_train, y_train)\n", " return model\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "predict rf ...\n", "0.8987051046527208\n", "predict lgb...\n", "0.9144414270113281\n", "predict lgb 2...\n", "0.9183965870229657\n" ] } ], "source": [ "# 这里针对三个单模进行训练,其中subA_rf/lgb/nn都是可以提交的模型\n", "# 单模没有进行调参,因此是弱分类器,效果可能不是很好。\n", "\n", "print('predict rf ...')\n", "model_rf = build_model_rf(X_train,y_train)\n", "val_rf = model_rf.predict(X_val)\n", "subA_rf = model_rf.predict(test[feats])\n", "rf_f1_score = f1_score(y_val,val_rf,average='macro')\n", "print(rf_f1_score)\n", "\n", "print('predict lgb...')\n", "model_lgb = build_model_lgb(X_train,y_train)\n", "val_lgb = model_lgb.predict(X_val)\n", "subA_lgb = model_lgb.predict(test[feats])\n", "lgb_f1_score = f1_score(y_val,val_lgb,average='macro')\n", "print(lgb_f1_score)\n", "\n", "\n", "print('predict lgb 2...')\n", "model_lgb2 = build_model_lgb2(X_train,y_train)\n", "val_lgb2 = model_lgb2.predict(X_val)\n", "subA_lgb2 = model_lgb2.predict(test[feats])\n", "lgb2_f1_score = f1_score(y_val,val_lgb2,average='macro')\n", "print(lgb2_f1_score)\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9142736444973326\n" ] } ], "source": [ "voting_clf = VotingClassifier(estimators=[('rf',model_rf ),\n", " ('lgb',model_lgb),\n", " ('lgb2',model_lgb2 )],voting='hard')\n", "\n", "voting_clf.fit(X_train,y_train)\n", "val_voting = voting_clf.predict(X_val)\n", "subA_voting = voting_clf.predict(test[feats])\n", "voting_f1_score = f1_score(y_val,val_voting,average='macro')\n", "print(voting_f1_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.4.3 Stacking融合" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "_N_FOLDS = 5 # 采用5折交叉验证\n", "kf = KFold(n_splits=_N_FOLDS, random_state=42) # sklearn的交叉验证模块,用于划分数据\n", "\n", "\n", "def get_oof(clf, X_train, y_train, X_test):\n", " oof_train = np.zeros((X_train.shape[0], 1)) \n", " oof_test_skf = np.empty((_N_FOLDS, X_test.shape[0], 1)) \n", " \n", " for i, (train_index, test_index) in enumerate(kf.split(X_train)): # 交叉验证划分此时的训练集和验证集\n", " kf_X_train = X_train.iloc[train_index,]\n", " kf_y_train = y_train.iloc[train_index,]\n", " kf_X_val = X_train.iloc[test_index,]\n", " \n", " clf.fit(kf_X_train, kf_y_train)\n", " \n", " oof_train[test_index] = clf.predict(kf_X_val).reshape(-1, 1) \n", " oof_test_skf[i, :] = clf.predict(X_test).reshape(-1, 1) \n", " \n", " oof_test = oof_test_skf.mean(axis=0) # 对每一则交叉验证的结果取平均\n", " return oof_train, oof_test # 返回当前分类器对训练集和测试集的预测结果" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# 将你的每个分类器都调用get_oof函数,并把它们的结果合并,就得到了新的训练和测试数据new_train,new_test\n", "new_train, new_test = [], []\n", "\n", "\n", "model1 = RandomForestClassifier(n_estimators = 100)\n", "model2 = lgb.LGBMClassifier(num_leaves=127,learning_rate = 0.1,n_estimators = 200)\n", "model3 = lgb.LGBMClassifier(num_leaves=63,learning_rate = 0.05,n_estimators = 400)\n", "\n", "for clf in [model1, model2, model3]:\n", " oof_train, oof_test = get_oof(clf, X_train, y_train, X_val)\n", " new_train.append(oof_train)\n", " new_test.append(oof_test)\n", " \n", "new_train = np.concatenate(new_train, axis=1)\n", "new_test = np.concatenate(new_test, axis=1)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.8816601744239989\n" ] } ], "source": [ "# 用新的训练数据new_train作为新的模型的输入,stacking第二层\n", "# 使用LogisticRegression作为第二层是为了防止模型过拟合\n", "# 这里使用的模型还有待优化,因此模型融合效果并不是很好\n", "clf = LogisticRegression()\n", "clf.fit(new_train, y_train)\n", "result = clf.predict(new_test)\n", "\n", "stacking_f1_score = f1_score(y_val,result,average='macro')\n", "print(stacking_f1_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.5 思考题" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. 如何基于stacking改进出blending - stacking使用了foldCV,blending使用了holdout.\n", "\n", "2. stacking还可以进行哪些优化提升F1-score - 从第一层模型数量?模型差异性?角度出发" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**参考内容**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://blog.csdn.net/weixin_44585839/article/details/110148396\n", "\n", "https://blog.csdn.net/weixin_39962758/article/details/111101263" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "**END.**\n", "\n", "【 张晋 :Datawhale成员,算法竞赛爱好者。CSDN:https://blog.csdn.net/weixin_44585839/】\n", "\n", "\n", "\n", "关于Datawhale:\n", "\n", "> Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。\n", "\n", "本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n", "\n", "![logo.png](https://img-blog.csdnimg.cn/2020090509294089.png)" ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }